Text Extraction

In This Topic

Text Extraction

The text extraction feature allows you to pull out the text from within a PDF document. Text can be extracted from an entire PDF document (using the GetText method of the PDFDocument class) or from within a certain page of a PDF (using the GetText method of the PdfPage class). The text returned from the GetText method is a string.

There are a couple of things to keep in mind when using the GetText method for extracting text from within a PDF:

Text that is part of an image, a form field or a note/comment will not be extracted.
Text will be extracted from the PDF in the order in which the PDF operators are loaded in the existing PDF.
During evaluation mode, text extraction is limited to 256 characters.

The following code will extract the text in an existing PDF document.

[Java]
    // Create the PDF document object
    PdfDocument pdfA = new PdfDocument( "[PhysicalPath]/MyDocument.pdf");
    // Call the GetText method from PDF document object to get the text from the document
    String extractedText = pdfA.getText();

The following code will extract the text from a specified page within a PDF.

[Java]
   // Create the PDF document object
   PdfDocument pdfA = new PdfDocument( "[PhysicalPath]/MyDocument.pdf");
   // Call the GetText method a PDF page to get the text from that page
   String extractedText = pdfA.getPages().getPdfPage(1).getText();

Comments on this topic